Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

176

Applications in Computer Vision

information discrepancy for distillation. γ controls the proportion of discrepant proposal

pairs, further validated in Section 6.5.4.

For each iteration, we ﬁrst solve the inner-level optimization, that is, the selection of the

proposal, by exhaustive sorting [249]; and then solve the upper-level optimization, distilling

the selected pair, based on the entropy distillation loss discussed in Section 6.5.3. Consid-

ering that there are not too many proposals involved, the process is relatively eﬃcient for

inner-level optimization.

6.5.3

Entropy Distillation Loss

After selecting a speciﬁc number of proposals, we crop the feature based on the proposals

we obtained. Most SOTA detection models are based on Feature Pyramid Networks (FPN)

[143], which can signiﬁcantly improve the robustness of multiscale detection. For the Faster-

RCNN framework in this paper, we resize the proposals and crop the features from each

stage of the neck feature maps. We generate the proposals from the regression layer of the

SSD framework and crop the features from the feature map of maximum spatial size. Then

we formulate the entropy distillation process as follows.

max

R^s

p(R^s

n^|^R^t

n⁾^.

(6.87)

Here is the upper level of the bi-level optimization, where m is solved and therefore

omitted. We rewrite Eq. 6.87 and further achieve our entropy distillation loss as

LP (w, α; γ) = (R^s

n ⁻^R^t

n^{) + Cov(}^R^s

n^{, R}^t

n⁾⁻¹⁽^R^s

n ⁻^R^t

n⁾²^{+ log(Cov(}^R^s

n^{, R}^t

n⁾⁾^,

(6.88)

where Cov(R^s

n^{, R}^t

n^{) =}^E⁽^R^s

n^R^t

n⁾⁻^E⁽^R^s

n⁾^E⁽^R^t

n^{) denotes the covariance matrix.}

Hence, we train the 1-bit student model end-to-end, the total loss for distilling the

student model is deﬁned as

L = LGT (w, α) + λLP (w, α; γ) + μLR(w, α),

(6.89)

where LGT is the detection loss derived from the ground truth label, and LR is deﬁned in

Equ. 6.80.

6.5.4

Ablation Study

Selecting the hyper-parameter. As mentioned above, we select hyperparameters λ, γ,

and μ in this part. First, we select μ, which controls the binarization process. As plotted in

Fig. 6.17 (a), we ﬁrst ﬁne-tune the hyperparameter μ controlling the binarization process

in four situations: raw BiRes18 and BiRes18 distilled by Hint [33], FGFI [235], and our

IDa-Det, respectively. In general, performance increases ﬁrst and then decreases when the

value of μ increases. On raw BiRes18 and IDa-Det BiRes18, the 1-bit student performs best

when μ is set as 1e-4. And μ valued 1e-3 is better for the Hint and the FGFI distilled 1-bit

student. Therefore, we set μ as 1e-4 for an extended ablation study. Figure 6.17 (b) shows

that the performances increase ﬁrst and then decrease with increasing λ from left to right.

In general, IDa-Det performs better with λ set as 0.4 and 0.6. With a variable value of γ,

we ﬁnd {λ, γ} = {0.4, 0.6} boost the performance of IDa-Det most, achieving 76.9% mAP

on VOC test2007. Based on the ablative study above, we set the hyperparameters λ, γ,

and μ as 0.4, 0.6, and 1e-4 for the experiments in this chapter.

Eﬀectiveness of components. We ﬁrst compare our information discrepancy-aware (IDa)

proposal selecting method with other methods to select proposals: Hint [33] (using the neck

feature without region mask) and FGFI [235]. We show the eﬀectiveness of IDa on two-

stage Faster-RCNN in Table 6.5. In Faster-RCNN, the introduction of IDa improves mAP